Introduction

How much homophony should we expect for a given wordform, as a function of its phonotactics and word length? Is this expected homophony more or less than the observed homophony for that wordform? And critically: does this Homophony Delta vary across wordforms as a function of Wordform Frequency?

That is, if there is a pressure against homophony in real lexica (Trott & Bergen, 2020), that pressure might manifest unevenly across the lexicon. Based on previous work, it is possible that the pressure against homophony is actually weakest for long wordforms, as well as phonotactically implausible wordforms. Why would this be the case? It seems counterintuitive to suppose that a lexicon designed for efficiency would not optimize the use of its wordforms by concentrating homophones among the shortest, most phonotactically plausible wordforms.

One explanation is that word length is correlated with frequency. A negative selection pressure might be strongest among wordforms for which, if ambiguous, would require very frequent disambiguation. If this interpretation is correct, Homophony Delta should correlate negatively with Frequency.

Again, this prediction appears counterintuitive, given the extensive evidence that wordform frequency is positively correlated with ambiguity, whether measured in terms of Number of homophones or Number of senses (Zipf, 1945; Piantadosi et al, 2012). But this observation would ultimately still be consistent with a mechanism selecting against homophones in the most frequent wordforms. The point is not that frequent wordforms should have fewer homophones than long wordforms; rather, frequent wordforms should have fewer homophones than one would expect on account of their phonotactics than length, while this gap, or Homophony Delta, should be smaller (or even inverted) for infrequent wordforms.

Calculating expected homophony

Given \(M\) meanings to express, and \(W\) wordforms with which to express them, each associated with some probability \(p_i\), how many meanings should each wordform \(w_i\) carry as a function of its normalized probability \(p_i\)?

Helper functions

We also define several helper functions to simplify the data processing pipeline.

First, we need a function that reads in an aggregated lexicon (i.e., across homophonous wordforms), as well as the original lexicon with original entries intact; this function will then identify the corrected number of meanings M per word length, and merge that information with the aggregated lexicon.

read_data = function(agg_path) {
  
  # Read in aggregated data
  df_agg = read_csv(agg_path)
  
  # Below, we identify the number of *words* of each word length.
  df_total_counts = df_agg %>%
    mutate(k_actual = num_homophones + 1) %>%
    group_by(num_sylls_est) %>%
    summarise(M = sum(k_actual))
  
  # Merge with counts
  df_agg = merge(df_agg, df_total_counts, by = "num_sylls_est")
  nrow(df_agg)
  
  df_agg
}

We then define a function to normalize the probability of a given wordform relative to the sum of probabilities of wordforms for that length.

normalize_probabilities = function(df) {
  ### For a given dataframe, normalize probabilities "p" relative to #words of a given elgnth.
  
  ## First get probabilities
  df = df %>%
    mutate(normalized_surprisal = heldout_surprisal/num_phones,
           p = 10 ** heldout_log_prob)
  
  # Calculate sum(p) for each syllable length bucket.
  df_syl_sums = df %>%
    group_by(num_sylls_est) %>%
    summarise(total_prob = sum(p))
  
  # Merge with aggregated data, and normalize
  df = df %>%
    left_join(df_syl_sums, on = "num_sylls_est") %>%
    mutate(p_normalized = p / total_prob)
  
  df
}

We now define a function to compute the expected number of homophones \(k_i\) for a given wordform \(w_i\). This will also calculate the delta between the real and expected amount.

compute_expected_homophones = function(df) {
  
  ## Compute expected homophones "k" of a wordform by multiplying normalized probability
  ## "p" by the number of meanings "M".
  
  df = df %>%
    # Expected number of "successes" (entries), minus 1
    mutate(k = p_normalized * M - 1) %>%
    mutate(delta = num_homophones - k)
  
  df
}

We also define a function to run the main regression:

run_regression = function(df) {
  
  # Initialize output
  out = list()
  
  # Build full model
  model_full = lm(data = df,
                  delta ~ frequency + num_sylls_est + normalized_surprisal)

  # Build reduced model
  model_reduced = lm(data = df,
                     delta ~ num_sylls_est + normalized_surprisal)
  
  # Run model comparison
  comparison = anova(model_reduced, model_full)
  df_comp = broom::tidy(comparison) %>%
    na.omit()
  
  # Tidy model output
  df_model = broom::tidy(model_full)
  
  # Add to list parameters
  out$model = df_model
  out$comparison = df_comp
  
  out
}

And several plotting functions:

plot_outcome = function(df_output) {
  
  df_output$model$predictor = fct_recode(
    df_output$model$term,
    "Number of Syllables" = "num_sylls_est",
    "Normalized Surprisal" = "normalized_surprisal",
    "Log(Frequency)" = "frequency"
    # "Neighborhood Size" = "neighborhood_size"
  )
  
  ### Plots outcome of regression
  g = df_output$model %>%
    ggplot(aes(x = predictor,
               y = estimate)) +
    geom_point() +
    coord_flip() +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = estimate - 2*std.error, 
                      ymax = estimate + 2*std.error), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = "Predictor",
         y = "Estimate") +
    theme_minimal()
  
  g
}

plot_comparison = function(df) {
  
  # Plots expected vs. actual per each wordform.
  g = df %>%
    ggplot(aes(x = k,
               y = num_homophones)) +
    geom_point(alpha = .5) +
    scale_x_continuous(limits = c(-1, max(df$k))) +
    scale_y_continuous(limits = c(0, max(df$k))) +
    geom_abline(intercept = 0, slope = 1, linetype = "dotted") + 
    labs(x = "Expected number of homophones",
         y = "Actual number of homophones") +
    theme_minimal()
  
  g
}


plot_bins = function(df, n, column, label) {
  
  # Plots delta ~ frequency bins.
  
  df$binned = as.numeric(ntile(pull(df, column), n = n))
  
  g = df %>%
    group_by(binned) %>%
    summarise(mean_delta = mean(delta),
              se_delta = sd(delta) / sqrt(n())) %>%
    ggplot(aes(x = binned,
               y = mean_delta)) +
    geom_point(stat = "identity", size = .2) +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = mean_delta - se_delta, 
                      ymax = mean_delta + se_delta), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = label,
         y = "Delta (Real - Expected)") +
    geom_smooth() +
    theme_minimal()

    g
}


plot_contrast = function(df, bins = 20) {
  
  df$frequency_binned = as.numeric(ntile(df$frequency, n = 20))
  
  g = df %>%
    mutate(expected = k,
           actual = num_homophones) %>%
    pivot_longer(c(expected, actual), names_to = "model", values_to = "homophones") %>%
    ggplot(aes(x = frequency_binned,
               y = homophones,
               color = model)) +
    stat_summary (fun = function(x){mean(x)},
                  fun.min = function(x){mean(x) - sd(x)/sqrt(length(x))},
                  fun.max = function(x){mean(x) + sd(x)/sqrt(length(x))},
                  geom= 'pointrange') +
                  # position=position_dodge(width=0.95)) +
    labs(x = "Binned Frequency",
         y = "Number of Homophones") +
    theme_bw()
  
  g
}

Homophony Delta ~ Frequency

English

First, we load and process the data.

## setwd("/Users/seantrott/Dropbox/UCSD/Research/Ambiguity/Evolution/homophony_delta/analyses")
df_english = read_data("../data/processed/english/reals/english_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   index = col_double(),
##   Word = col_character(),
##   CobLog = col_double(),
##   CompCnt = col_double(),
##   PhonDISC = col_character(),
##   Class = col_character(),
##   SylCnt = col_double(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double(),
##   rank_num_homophones = col_double(),
##   neighborhood_size = col_double(),
##   heldout_log_prob = col_double(),
##   heldout_surprisal = col_double()
## )
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_english)
## [1] 35107
## Now normalize probabilities and compute expected homophones per wordform
df_english = df_english %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_english %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
##   num_sylls_est  total correct_total
##           <dbl>  <dbl>         <dbl>
## 1             1  7706.          7706
## 2             2 15247.         15247
## 3             3 11379          11379
## 4             4  5316           5316
## 5             5  1694           1694
## 6             6   439            439
## 7             7    95             95
## 8             8    10             10
## 9            10     1              1

We then merge with frequency data.

df_subtlex = read_csv("../data/frequency/english/SUBTLEXusfrequencyabove1.csv")
## Parsed with column specification:
## cols(
##   Word = col_character(),
##   FREQcount = col_double(),
##   CDcount = col_double(),
##   FREQlow = col_double(),
##   Cdlow = col_double(),
##   SUBTLWF = col_double(),
##   Lg10WF = col_double(),
##   SUBTLCD = col_double(),
##   Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 60384
df_subtlex$Word = tolower(df_subtlex$Word)

df_merged_english = df_english %>%
  inner_join(df_subtlex, on = "Word")
## Joining, by = "Word"
nrow(df_merged_english)
## [1] 24307
df_merged_english$frequency = df_merged_english$Lg10WF

Now we can visualize the outcome in a variety of ways:

df_merged_english %>%
  plot_comparison()

df_merged_english %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_english %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -6.59     0.425     -15.5  4.55e- 54
## 2 frequency              -0.655    0.0837     -7.82 5.29e- 15
## 3 num_sylls_est           0.674    0.0791      8.52 1.71e- 17
## 4 normalized_surprisal    3.92     0.154      25.4  1.95e-140
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  24303 2731590.     1 6882.      61.2 5.29e-15

Directly contrast the relationships:

df_merged_english %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

df_english_lemmas = read_delim("../data/frequency/english/lemmas.csv", delim = "\\",
                               quote = "")
## Parsed with column specification:
## cols(
##   PhonDISC = col_character(),
##   Cob = col_double()
## )
nrow(df_english_lemmas)
## [1] 52447
df_english_lemmas = df_english_lemmas %>%
  mutate(freq_adjusted = Cob + 1) %>%
  group_by(PhonDISC) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_english_lemmas %>%
  group_by(PhonDISC) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)),
            total_frequency = mean(total_frequency))
## `summarise()` ungrouping output (override with `.groups` argument)
df_merged_english = df_merged_english %>%
  # inner_join(df_entropy, by = "PhonDISC")
  inner_join(df_entropy, by = "PhonDISC")

nrow(filter(df_merged_english, num_homophones == 1))
## [1] 3550
model_full = lm(data = filter(df_merged_english, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_merged_english, num_homophones == 1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -245.772   -0.664    1.137    2.715    7.410 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -7.8108     1.0916  -7.155 1.01e-12 ***
## entropy                1.1216     0.7343   1.527 0.126760    
## frequency             -1.1569     0.1831  -6.318 2.99e-10 ***
## num_sylls_est          0.7758     0.2330   3.330 0.000878 ***
## normalized_surprisal   5.1642     0.3501  14.750  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.384 on 3545 degrees of freedom
## Multiple R-squared:  0.07563,    Adjusted R-squared:  0.07458 
## F-statistic: 72.51 on 4 and 3545 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_english, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   3546 312343                           
## 2   3545 312138  1    205.41 2.3328 0.1268

Dutch

First, we load and process the data.

df_dutch = read_data("../data/processed/dutch/reals/dutch_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   PhonStrsDISC = col_character(),
##   PhonCVBr = col_character(),
##   PhonSylBCLX = col_character(),
##   PhonStrsStDISC = col_character(),
##   PhonStCVBr = col_character(),
##   PhonSylStBCLX = col_character(),
##   PhonolCLX = col_character(),
##   PhonolCPA = col_character(),
##   PhonDISC = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_dutch)
## [1] 65260
## Now normalize probabilities and compute expected homophones per wordform
df_dutch = df_dutch %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_dutch %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
##   num_sylls_est  total correct_total
##           <dbl>  <dbl>         <dbl>
## 1             1  4672           4672
## 2             2 20588          20588
## 3             3 26420          26420
## 4             4 11338          11338
## 5             5  3394           3394
## 6             6   846.           846
## 7             7   189.           189
## 8             8    28             28
## 9             9     2              2

We then merge with frequency data.

df_subtlex = read_delim("../data/frequency/dutch/SUBTLEX-NL.txt", delim="\t")
## Parsed with column specification:
## cols(
##   Word = col_character(),
##   FREQcount = col_double(),
##   CDcount = col_double(),
##   FREQlow = col_double(),
##   CDlow = col_double(),
##   FREQlemma = col_double(),
##   SUBTLEXWF = col_double(),
##   Lg10WF = col_double(),
##   SUBTLEXCD = col_double(),
##   Lg10CD = col_double()
## )
nrow(df_subtlex)
## [1] 437503
df_subtlex$Word = tolower(df_subtlex$Word)

df_merged_dutch = df_dutch %>%
  inner_join(df_subtlex, on = "Word") 
## Joining, by = "Word"
nrow(df_merged_dutch)
## [1] 28330
df_merged_dutch$frequency = df_merged_dutch$Lg10WF

Now we can visualize the outcome in a variety of ways:

df_merged_dutch %>%
  plot_comparison()

df_merged_dutch %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_dutch %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 x 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)            -6.64      0.834     -7.96 1.73e-15
## 2 frequency              -3.38      0.170    -19.8  4.90e-87
## 3 num_sylls_est           0.289     0.164      1.77 7.75e- 2
## 4 normalized_surprisal    6.58      0.328     20.1  7.51e-89
output$comparison
## # A tibble: 1 x 6
##   res.df       rss    df   sumsq statistic  p.value
##    <dbl>     <dbl> <dbl>   <dbl>     <dbl>    <dbl>
## 1  28326 15675501.     1 217909.      394. 4.90e-87

Directly contrast the relationships:

df_merged_dutch %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

df_dutch_lemmas = read_delim("../data/frequency/dutch/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
##   PhonDISC = col_character(),
##   Inl = col_double()
## )
nrow(df_dutch_lemmas)
## [1] 123926
df_dutch_lemmas = df_dutch_lemmas %>%
  mutate(freq_adjusted = Inl + 1) %>%
  group_by(PhonDISC) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_dutch_lemmas %>%
  group_by(PhonDISC) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_merged_dutch = df_merged_dutch %>%
  inner_join(df_entropy, by = "PhonDISC")
nrow(df_merged_dutch)
## [1] 27615
nrow(filter(df_merged_dutch, num_homophones == 1))
## [1] 1258
model_full = lm(data = filter(df_merged_dutch, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_merged_dutch, num_homophones == 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -560.62   -2.43    2.57    6.89   20.04 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -6.3964     3.7514  -1.705   0.0884 .  
## entropy                0.6487     2.8147   0.230   0.8178    
## frequency             -4.8754     0.6750  -7.223 8.83e-13 ***
## num_sylls_est         -1.1844     0.9042  -1.310   0.1905    
## normalized_surprisal   9.7822     1.2843   7.617 5.11e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.6 on 1253 degrees of freedom
## Multiple R-squared:  0.0873, Adjusted R-squared:  0.08438 
## F-statistic: 29.96 on 4 and 1253 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_merged_dutch, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   1254 758586                           
## 2   1253 758554  1    32.161 0.0531 0.8178

German

df_german = read_data("../data/processed/german/reals/german_with_mps_5phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   index = col_double(),
##   Word = col_character(),
##   CompCnt = col_double(),
##   PhonDISC = col_character(),
##   SylCnt = col_double(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double(),
##   rank_num_homophones = col_double(),
##   neighborhood_size = col_double(),
##   heldout_log_prob = col_double(),
##   heldout_surprisal = col_double()
## )
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_german)
## [1] 50435
## Now normalize probabilities and compute expected homophones per wordform
df_german = df_german %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_german %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 10 x 3
##    num_sylls_est total correct_total
##            <dbl> <dbl>         <dbl>
##  1             1  2454          2454
##  2             2 13181         13181
##  3             3 19429         19429
##  4             4 10983         10983
##  5             5  3971          3971
##  6             6  1240          1240
##  7             7   328           328
##  8             8    72            72
##  9             9    16            16
## 10            10     2             2

We then merge with frequency data.

df_freq = read_csv("../data/frequency/german/SUBTLEX-DE.csv")
## Parsed with column specification:
## cols(
##   Word = col_character(),
##   `Spell check` = col_double(),
##   FREQcount = col_double()
## )
nrow(df_freq)
## [1] 377524
# df_freq$Word = tolower(df_freq$Word)

# log10 frequency (+ 1). SHould be analogous to measures from English.
df_freq$Lg10WF = log10(df_freq$FREQcount + 1)

df_merged_german = df_german %>%
  inner_join(df_freq, on = "Word") 
## Joining, by = "Word"
nrow(df_merged_german)
## [1] 24350
df_merged_german$frequency = df_merged_german$Lg10WF

Now we can visualize the outcome in a variety of ways:

df_merged_german %>%
  plot_comparison()

df_merged_german %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_merged_german %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -4.76     0.399     -12.0  7.99e- 33
## 2 frequency              -1.67     0.0888    -18.8  3.12e- 78
## 3 num_sylls_est           0.316    0.0731      4.32 1.55e-  5
## 4 normalized_surprisal    4.19     0.157      26.7  2.36e-154
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df  sumsq statistic  p.value
##    <dbl>    <dbl> <dbl>  <dbl>     <dbl>    <dbl>
## 1  24346 2750831.     1 39903.      353. 3.12e-78

Directly contrast the relationships:

df_merged_german %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

df_german_lemmas = read_delim("../data/frequency/german/lemmas.csv", delim = "\\")
## Parsed with column specification:
## cols(
##   PhonDISC = col_character(),
##   Mann = col_double()
## )
nrow(df_german_lemmas)
## [1] 51728
df_german_lemmas = df_german_lemmas %>%
  mutate(freq_adjusted = Mann + 1) %>%
  group_by(PhonDISC) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_german_lemmas %>%
  group_by(PhonDISC) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_merged_german = df_merged_german %>%
  inner_join(df_entropy, by = "PhonDISC")
nrow(df_merged_german)
## [1] 23812
nrow(filter(df_merged_german, num_homophones == 1))
## [1] 740
model_full = lm(data = filter(df_merged_german, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_merged_german, num_homophones == 1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -257.105   -0.305    3.717    6.664   13.163 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -13.8831     4.2261  -3.285  0.00107 ** 
## entropy                0.6983     3.2098   0.218  0.82785    
## frequency             -2.0243     0.6702  -3.021  0.00261 ** 
## num_sylls_est          0.7451     0.9702   0.768  0.44270    
## normalized_surprisal   9.0383     1.3466   6.712 3.84e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 19.63 on 735 degrees of freedom
## Multiple R-squared:  0.07683,    Adjusted R-squared:  0.07181 
## F-statistic: 15.29 on 4 and 735 DF,  p-value: 5.087e-12
model_reduced = lm(data = filter(df_merged_german, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    736 283102                           
## 2    735 283084  1    18.226 0.0473 0.8278

French

df_french = read_data("../data/processed/french/reals/french_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `1_ortho` = col_character(),
##   `2_phon` = col_character(),
##   `3_lemme` = col_character(),
##   `4_cgram` = col_character(),
##   `5_genre` = col_character(),
##   `6_nombre` = col_character(),
##   `11_infover` = col_character(),
##   `17_cvcv` = col_character(),
##   `18_p_cvcv` = col_character(),
##   `23_syll` = col_character(),
##   `25_cv-cv` = col_character(),
##   `26_orthrenv` = col_character(),
##   `27_phonrenv` = col_character(),
##   `28_orthosyll` = col_character(),
##   `29_cgramortho` = col_character(),
##   `34_morphoder` = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_french)
## [1] 37278
## Now normalize probabilities and compute expected homophones per wordform
df_french = df_french %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_french %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 9 x 3
##   num_sylls_est total correct_total
##           <dbl> <dbl>         <dbl>
## 1             1  3893          3893
## 2             2 13794         13794
## 3             3 15258         15258
## 4             4  7589          7589
## 5             5  2466          2466
## 6             6   670           670
## 7             7    97            97
## 8             8    12            12
## 9             9     3             3

French already has frequency data, so we just rename the column.

Now we can visualize the outcome in a variety of ways:

df_french$frequency = log10(df_french$`10_freqlivres` + 1)

df_french %>%
  plot_comparison()

df_french %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_french %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -5.82     0.262     -22.2  3.01e-108
## 2 frequency              -0.940    0.104      -9.00 2.41e- 19
## 3 num_sylls_est           0.580    0.0474     12.2  2.38e- 34
## 4 normalized_surprisal    3.19     0.109      29.3  5.81e-187
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  37274 2687005.     1 5836.      81.0 2.41e-19

Directly contrast the relationships:

df_french %>%
  plot_contrast(bins = 20)

Japanese

df_japanese = read_data("../data/processed/japanese/reals/japanese_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   orth_form_kanji = col_character(),
##   orth_form_hiragana = col_character(),
##   orth_form_romaji = col_character(),
##   phonetic_form = col_character(),
##   morph_form = col_character(),
##   glosses = col_character(),
##   multiple_pronunications = col_logical(),
##   phonetic_remapped = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_japanese)
## [1] 40449
## Now normalize probabilities and compute expected homophones per wordform
df_japanese = df_japanese %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
## Double-check that this equals correct amount in lexicon
df_japanese %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 17 x 3
##    num_sylls_est    total correct_total
##            <dbl>    <dbl>         <dbl>
##  1             1  1011             1011
##  2             2  7710.            7710
##  3             3 14816            14816
##  4             4 14237            14237
##  5             5  6619             6619
##  6             6  3673             3673
##  7             7  1657             1657
##  8             8   733              733
##  9             9   345              345
## 10            10   185              185
## 11            11    72               72
## 12            12    52               52
## 13            13    19               19
## 14            14     8                8
## 15            15     7.00             7
## 16            16     2                2
## 17            22     1                1

Japanese already has frequency data, but we need to make sure it's summed across lemmas.

df_japanese_lemmas = read_csv("../data/processed/japanese/reals/japanese_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   orth_form_kanji = col_character(),
##   orth_form_hiragana = col_character(),
##   orth_form_romaji = col_character(),
##   phonetic_form = col_character(),
##   morph_form = col_character(),
##   frequency = col_double(),
##   glosses = col_character(),
##   multiple_pronunications = col_logical(),
##   phonetic_remapped = col_character(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double()
## )
nrow(df_japanese_lemmas)
## [1] 51147
df_japanese_lemmas = df_japanese_lemmas %>%
  group_by(phonetic_remapped) %>%
  summarise(total_frequency = sum(frequency))
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_japanese_lemmas)
## [1] 40449
df_japanese_merged = df_japanese %>%
  inner_join(df_japanese_lemmas, by = "phonetic_remapped")

nrow(df_japanese)
## [1] 40449
nrow(df_japanese_merged)
## [1] 40449
df_japanese_merged$frequency = log10(df_japanese_merged$total_frequency + 1)

Now, visualize the relationship.

df_japanese_merged %>%
  plot_comparison()

df_japanese_merged %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_japanese_merged %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -3.79     0.144     -26.3  3.12e-151
## 2 frequency              -1.68     0.0896    -18.8  3.48e- 78
## 3 num_sylls_est           0.116    0.0183      6.35 2.19e- 10
## 4 normalized_surprisal    2.95     0.0851     34.7  1.48e-259
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df  sumsq statistic  p.value
##    <dbl>    <dbl> <dbl>  <dbl>     <dbl>    <dbl>
## 1  40445 1203990.     1 10476.      352. 3.48e-78

Directly contrast the relationships:

df_japanese_merged %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

df_japanese_lemmas = read_csv("../data/processed/japanese/reals/japanese_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   `Unnamed: 0` = col_double(),
##   orth_form_kanji = col_character(),
##   orth_form_hiragana = col_character(),
##   orth_form_romaji = col_character(),
##   phonetic_form = col_character(),
##   morph_form = col_character(),
##   frequency = col_double(),
##   glosses = col_character(),
##   multiple_pronunications = col_logical(),
##   phonetic_remapped = col_character(),
##   num_phones = col_double(),
##   num_sylls_est = col_double(),
##   remove = col_logical(),
##   num_homophones = col_double(),
##   log_prob = col_double(),
##   surprisal = col_double()
## )
nrow(df_japanese_lemmas)
## [1] 51147
df_japanese_lemmas = df_japanese_lemmas %>%
  mutate(freq_adjusted = frequency + 1) %>%
  # mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
  group_by(phonetic_remapped) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_japanese_lemmas %>%
  group_by(phonetic_remapped) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_japanese_entropy = df_japanese_merged %>%
  inner_join(df_entropy, by = "phonetic_remapped")
nrow(df_japanese_entropy)
## [1] 40449
nrow(filter(df_japanese_entropy, num_homophones == 1))
## [1] 4034
model_full = lm(data = filter(df_japanese_entropy, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_japanese_entropy, num_homophones == 1))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -229.639   -0.132    1.238    2.042    5.530 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -5.75904    1.18277  -4.869 1.16e-06 ***
## entropy               2.35183    1.48902   1.579   0.1143    
## frequency            -0.89102    0.37742  -2.361   0.0183 *  
## num_sylls_est         0.07058    0.11106   0.636   0.5251    
## normalized_surprisal  3.42456    0.32391  10.573  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.442 on 4029 degrees of freedom
## Multiple R-squared:  0.03562,    Adjusted R-squared:  0.03466 
## F-statistic:  37.2 on 4 and 4029 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_japanese_entropy, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   4030 167317                           
## 2   4029 167214  1    103.53 2.4946 0.1143

Mandarin: CallHome

df_mandarin = read_data("../data/processed/mandarin/reals/mandarin_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Key = col_character(),
##   `PY+T` = col_character(),
##   `Pho+T` = col_character(),
##   `IPA+T` = col_character(),
##   `PY-T` = col_character(),
##   `Pho-T` = col_character(),
##   `IPA-T` = col_character(),
##   Tone = col_character(),
##   SyStruct = col_character(),
##   Words = col_character(),
##   `Dominate-POS` = col_character(),
##   `Other-POSes` = col_character(),
##   Neighbors = col_character(),
##   file = col_character(),
##   homophonous_words = col_character(),
##   word = col_character(),
##   glides_remapped = col_character(),
##   phonetic_remapped = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin)
## [1] 45871
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin = df_mandarin %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
nrow(df_mandarin)
## [1] 45871
## Double-check that this equals correct amount in lexicon
df_mandarin %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 3
##    num_sylls_est total correct_total
##            <dbl> <dbl>         <dbl>
##  1             1  4658          4658
##  2             2 32372         32372
##  3             3  9186          9186
##  4             4  5336          5336
##  5             5   312           312
##  6             6    97            97
##  7             7    56            56
##  8             8    17            17
##  9             9     3             3
## 10            10     6             6
## 11            11     3             3
## 12            14     1             1

Mandarin already has frequency data, but we need to make sure it's summed across lemmas.

df_mandarin_lemmas = read_csv("../data/processed/mandarin/reals/mandarin_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Key = col_character(),
##   `PY+T` = col_character(),
##   `Pho+T` = col_character(),
##   `IPA+T` = col_character(),
##   `PY-T` = col_character(),
##   `Pho-T` = col_character(),
##   `IPA-T` = col_character(),
##   Tone = col_character(),
##   SyStruct = col_character(),
##   Words = col_character(),
##   `Dominate-POS` = col_character(),
##   `Other-POSes` = col_character(),
##   Neighbors = col_character(),
##   file = col_character(),
##   homophonous_words = col_character(),
##   word = col_character(),
##   glides_remapped = col_character(),
##   phonetic_remapped = col_character(),
##   remove = col_logical()
## )
## See spec(...) for full column specifications.
df_mandarin_lemmas = df_mandarin_lemmas %>%
  mutate(frequency = FreqPM * 1000000) %>%
  group_by(phonetic_remapped) %>%
  summarise(total_frequency = sum(frequency))
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin_lemmas)
## [1] 45871
df_mandarin_merged = df_mandarin %>%
  inner_join(df_mandarin_lemmas, by = "phonetic_remapped")

nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 45871
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency + 1)

Now we can visualize the outcome in a variety of ways:

df_mandarin_merged %>%
  plot_comparison()

df_mandarin_merged %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_mandarin_merged %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 x 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)           -3.77      0.325     -11.6  4.16e-31
## 2 frequency             -0.284     0.0548     -5.19 2.10e- 7
## 3 num_sylls_est          0.0996    0.0625      1.59 1.11e- 1
## 4 normalized_surprisal   4.37      0.256      17.1  2.67e-65
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df sumsq statistic     p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>       <dbl>
## 1  45867 4453802.     1 2617.      26.9 0.000000210

Directly contrast the relationships:

df_mandarin_merged %>%
  plot_contrast(bins = 20)

Mandarin: Chinese Lexical Database

Here, we calculate the Homophony Delta for Mandarin Chinese, using the Chinese Lexical Database.

df_mandarin_cld = read_data("../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   C1 = col_character(),
##   C2 = col_character(),
##   C3 = col_logical(),
##   C4 = col_logical(),
##   C1Structure = col_character(),
##   C2Structure = col_character(),
##   C3Structure = col_logical(),
##   C4Structure = col_logical(),
##   C1Type = col_character(),
##   C2Type = col_character(),
##   C3Type = col_logical(),
##   C4Type = col_logical(),
##   Pinyin = col_character(),
##   C1Pinyin = col_character(),
##   C2Pinyin = col_character(),
##   C3Pinyin = col_logical(),
##   C4Pinyin = col_logical(),
##   IPA = col_character(),
##   C1IPA = col_character()
##   # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 712019 parsing failures.
##   row         col           expected actual                                                                            file
## 30786 C3          1/0/T/F/TRUE/FALSE  么    '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Structure 1/0/T/F/TRUE/FALSE  SG    '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Type      1/0/T/F/TRUE/FALSE  Other '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3Pinyin    1/0/T/F/TRUE/FALSE  me5   '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## 30786 C3IPA       1/0/T/F/TRUE/FALSE  mɤ5   '../data/processed/mandarin_cld/reals/mandarin_cld_with_mps_4phone_holdout.csv'
## ..... ........... .................. ...... ...............................................................................
## See problems(...) for more details.
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin_cld)
## [1] 41009
## Now normalize probabilities and compute expected homophones per wordform
df_mandarin_cld = df_mandarin_cld %>% 
  normalize_probabilities() %>%
  compute_expected_homophones()
## `summarise()` ungrouping output (override with `.groups` argument)
## Joining, by = "num_sylls_est"
nrow(df_mandarin_cld)
## [1] 41009
## Double-check that this equals correct amount in lexicon
df_mandarin_cld %>%
  group_by(num_sylls_est) %>%
  summarise(total = sum(k + 1),
            correct_total = mean(M))
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 4 x 3
##   num_sylls_est  total correct_total
##           <dbl>  <dbl>         <dbl>
## 1             1  3648           3648
## 2             2 31637.         31637
## 3             3  6930           6930
## 4             4  3337           3337

Mandarin already has frequency data, but we need to make sure it's summed across lemmas.

df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   C1 = col_character(),
##   C2 = col_character(),
##   C3 = col_logical(),
##   C4 = col_logical(),
##   C1Structure = col_character(),
##   C2Structure = col_character(),
##   C3Structure = col_logical(),
##   C4Structure = col_logical(),
##   C1Type = col_character(),
##   C2Type = col_character(),
##   C3Type = col_logical(),
##   C4Type = col_logical(),
##   Pinyin = col_character(),
##   C1Pinyin = col_character(),
##   C2Pinyin = col_character(),
##   C3Pinyin = col_logical(),
##   C4Pinyin = col_logical(),
##   IPA = col_character(),
##   C1IPA = col_character()
##   # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
##   row         col           expected actual                                                                     file
## 35286 C3          1/0/T/F/TRUE/FALSE  么    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE  SG    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type      1/0/T/F/TRUE/FALSE  Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin    1/0/T/F/TRUE/FALSE  me5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA       1/0/T/F/TRUE/FALSE  mɤ5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
  group_by(phonetic_remapped) %>%
  summarise(total_frequency = sum(FrequencyRaw))
## `summarise()` ungrouping output (override with `.groups` argument)
nrow(df_mandarin_lemmas)
## [1] 41009
df_mandarin_merged = df_mandarin_cld %>%
  inner_join(df_mandarin_lemmas, by = "phonetic_remapped")

nrow(df_mandarin)
## [1] 45871
nrow(df_mandarin_merged)
## [1] 41009
df_mandarin_merged$frequency = log10(df_mandarin_merged$total_frequency)

Now we can visualize the outcome in a variety of ways:

df_mandarin_merged %>%
  plot_comparison()

df_mandarin_merged %>%
  plot_bins(n = 20, column = "frequency", label = "Binned frequency")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And finally, we can run the regression and visualize model coefficients:

output = df_mandarin_merged %>% 
  run_regression()

output %>%
  plot_outcome()

output$model
## # A tibble: 4 x 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)          -1.85       0.0987   -18.7   5.10e-78
## 2 frequency            -0.275      0.0156   -17.6   8.29e-69
## 3 num_sylls_est         0.00764    0.0221     0.345 7.30e- 1
## 4 normalized_surprisal  2.71       0.0683    39.7   0.
output$comparison
## # A tibble: 1 x 6
##   res.df     rss    df sumsq statistic  p.value
##    <dbl>   <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  41005 303218.     1 2281.      309. 8.29e-69

Directly contrast the relationships:

df_mandarin_merged %>%
  plot_contrast(bins = 20)

Calculate Sense Entropy

df_mandarin_lemmas = read_csv("../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv")
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Word = col_character(),
##   C1 = col_character(),
##   C2 = col_character(),
##   C3 = col_logical(),
##   C4 = col_logical(),
##   C1Structure = col_character(),
##   C2Structure = col_character(),
##   C3Structure = col_logical(),
##   C4Structure = col_logical(),
##   C1Type = col_character(),
##   C2Type = col_character(),
##   C3Type = col_logical(),
##   C4Type = col_logical(),
##   Pinyin = col_character(),
##   C1Pinyin = col_character(),
##   C2Pinyin = col_character(),
##   C3Pinyin = col_logical(),
##   C4Pinyin = col_logical(),
##   IPA = col_character(),
##   C1IPA = col_character()
##   # ... with 125 more columns
## )
## See spec(...) for full column specifications.
## Warning: 715162 parsing failures.
##   row         col           expected actual                                                                     file
## 35286 C3          1/0/T/F/TRUE/FALSE  么    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Structure 1/0/T/F/TRUE/FALSE  SG    '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Type      1/0/T/F/TRUE/FALSE  Other '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3Pinyin    1/0/T/F/TRUE/FALSE  me5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## 35286 C3IPA       1/0/T/F/TRUE/FALSE  mɤ5   '../data/processed/mandarin_cld/reals/mandarin_cld_all_reals_4phone.csv'
## ..... ........... .................. ...... ........................................................................
## See problems(...) for more details.
nrow(df_mandarin_lemmas)
## [1] 45552
df_mandarin_lemmas = df_mandarin_lemmas %>%
  mutate(freq_adjusted = FrequencyRaw + 1) %>%
  # mutate(freq_adjusted = log10(frequency + 1) + .01) %>%
  group_by(phonetic_remapped) %>%
  mutate(total_frequency = sum(freq_adjusted),
         relative_frequency = freq_adjusted / total_frequency)

df_entropy = df_mandarin_lemmas %>%
  group_by(phonetic_remapped) %>%
  summarise(entropy = -sum(relative_frequency * log(relative_frequency)))
## `summarise()` ungrouping output (override with `.groups` argument)
df_mandarin_entropy = df_mandarin_merged %>%
  inner_join(df_entropy, by = "phonetic_remapped")
nrow(df_mandarin_entropy)
## [1] 41009
nrow(filter(df_mandarin_entropy, num_homophones == 1))
## [1] 1871
model_full = lm(data = filter(df_mandarin_entropy, num_homophones == 1),
                delta ~ entropy + frequency + num_sylls_est + normalized_surprisal)

summary(model_full)
## 
## Call:
## lm(formula = delta ~ entropy + frequency + num_sylls_est + normalized_surprisal, 
##     data = filter(df_mandarin_entropy, num_homophones == 1))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -40.710  -0.380   0.886   1.827   4.344 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -2.7619     0.9007  -3.067  0.00220 ** 
## entropy               -1.3703     0.4268  -3.211  0.00135 ** 
## frequency             -0.7130     0.1257  -5.670 1.65e-08 ***
## num_sylls_est          0.5328     0.2405   2.216  0.02683 *  
## normalized_surprisal   4.4816     0.4227  10.601  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.655 on 1866 degrees of freedom
## Multiple R-squared:  0.07813,    Adjusted R-squared:  0.07615 
## F-statistic: 39.54 on 4 and 1866 DF,  p-value: < 2.2e-16
model_reduced = lm(data = filter(df_mandarin_entropy, num_homophones ==1),
                delta ~ frequency + num_sylls_est + normalized_surprisal)

anova(model_reduced, model_full)
## Analysis of Variance Table
## 
## Model 1: delta ~ frequency + num_sylls_est + normalized_surprisal
## Model 2: delta ~ entropy + frequency + num_sylls_est + normalized_surprisal
##   Res.Df   RSS Df Sum of Sq      F   Pr(>F)   
## 1   1867 25067                                
## 2   1866 24929  1    137.71 10.308 0.001347 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Visualizations

Combine lexica

Below, we combine the lexica so that we can visualize the central relationships in the same plot.

df_merged_english_f = df_merged_english %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'English') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_merged_dutch_f = df_merged_dutch %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'Dutch') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_merged_german_f = df_merged_german %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'German') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_french_f = df_french %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'French') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_japanese_f = df_japanese_merged %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'Japanese') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)

df_mandarin_cld_f = df_mandarin_merged %>%
  mutate(binned_frequency = ntile(frequency, n = 20)) %>%
  mutate(binned_neighorhood_size = ntile(neighborhood_size, n = 20)) %>%
  mutate(language = 'Mandarin') %>%
  select(num_sylls_est, normalized_surprisal, binned_neighorhood_size,
         num_homophones, k, frequency, binned_frequency, delta, language)


df_all_lexica = df_merged_english_f %>%
  rbind(df_merged_dutch_f) %>%
  rbind(df_merged_german_f) %>%
  rbind(df_french_f) %>%
  rbind(df_japanese_f) %>%
  rbind(df_mandarin_cld_f) 

Visualize relationships

PlotTheme = theme(
  axis.title.x = element_text(size = 16),
  axis.text.x = element_text(size = 14),
  axis.text.y = element_text(size = 14),
  axis.title.y = element_text(size = 16),
  strip.text.x = element_text(size = 16),
  title = element_text(size = 16),
  legend.text = element_text(size = 16),
  legend.title = element_text(size = 16))

df_all_lexica %>%
  group_by(binned_frequency, language) %>%
  summarise(mean_delta = mean(delta),
            se_delta = sd(delta) / sqrt(n())) %>%
  ggplot(aes(x = binned_frequency,
             y = mean_delta)) +
  geom_point(stat = "identity", size = .2) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = mean_delta - 2 * se_delta, 
                    ymax = mean_delta + 2 *se_delta), 
                width=.2,
                position=position_dodge(.9)) +
  labs(x = "Binned Frequency",
       y = "Delta (Real - Expected)") +
  geom_smooth() +
  theme_bw() +
  facet_wrap(~language) +
  PlotTheme
## `summarise()` regrouping output by 'binned_frequency' (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggsave("../Figures/delta.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
  mutate(Baseline = k,
         Real = num_homophones) %>%
  pivot_longer(c(Baseline, Real), 
               names_to = "Lexicon", 
               values_to = "homophones") %>%
  group_by(binned_frequency, language, Lexicon) %>%
  summarise(mean_homophones = mean(homophones),
            se_homophones = sd(homophones) / sqrt(n())) %>%
  ggplot(aes(x = binned_frequency,
             y = mean_homophones,
             color = Lexicon)) +
  geom_point(stat = "identity", size = .5, alpha = .5) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones, 
                    ymax = mean_homophones + 2 * se_homophones), 
                width=.2) +
  labs(x = "Binned Frequency",
       y = "Number of Homophones") +
  geom_smooth(alpha = .6) +
  theme_bw() +
  facet_wrap(~language) +
  PlotTheme +
  theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` regrouping output by 'binned_frequency', 'language' (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggsave("../Figures/direct_comparison_frequency.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
df_all_lexica %>%
  mutate(Baseline = k,
         Real = num_homophones) %>%
  pivot_longer(c(Baseline, Real), 
               names_to = "Lexicon", 
               values_to = "homophones") %>%
  group_by(binned_neighorhood_size, language, Lexicon) %>%
  summarise(mean_homophones = mean(homophones),
            se_homophones = sd(homophones) / sqrt(n())) %>%
  ggplot(aes(x = binned_neighorhood_size,
             y = mean_homophones,
             color = Lexicon)) +
  geom_point(stat = "identity", size = .5, alpha = .5) +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = mean_homophones - 2 * se_homophones, 
                    ymax = mean_homophones + 2 * se_homophones), 
                width=.2) +
  labs(x = "Binned Neighborhood Size",
       y = "Number of Homophones") +
  geom_smooth(alpha = .6) +
  theme_bw() +
  facet_wrap(~language) +
  PlotTheme +
  theme(panel.spacing.x = unit(1.5, "lines"))
## `summarise()` regrouping output by 'binned_neighorhood_size', 'language' (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggsave("../Figures/direct_comparison_neighbors.png", dpi = 300)
## Saving 7 x 5 in image
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Visualize coefficients

return_regression_table = function(df) {
  # Build full model
  model_full = lm(data = df,
                  delta ~ frequency + num_sylls_est + normalized_surprisal)
  # Tidy model output
  df_model = broom::tidy(model_full)
  df_model
}

### Get coefficients for each language
english_coefficients = df_merged_english_f %>% 
  return_regression_table() %>%
  mutate(language = 'English')

dutch_coefficients = df_merged_dutch_f %>% 
  return_regression_table() %>%
  mutate(language = 'Dutch')

german_coefficients = df_merged_german_f %>% 
  return_regression_table() %>%
  mutate(language = 'German')

french_coefficients = df_french_f %>% 
  return_regression_table() %>%
  mutate(language = 'French')

japanese_coefficients = df_japanese_f %>% 
  return_regression_table() %>%
  mutate(language = 'Japanese')

mandarin_coefficients = df_mandarin_cld_f %>% 
  return_regression_table() %>%
  mutate(language = 'Mandarin')

# Combine into single dataframe
df_all_coefficients = english_coefficients %>%
  rbind(dutch_coefficients) %>%
  rbind(german_coefficients) %>%
  rbind(french_coefficients) %>%
  rbind(japanese_coefficients) %>%
  rbind(mandarin_coefficients) 


df_all_coefficients$predictor = fct_recode(
  df_all_coefficients$term,
  "Number of Syllables" = "num_sylls_est",
  "Normalized Surprisal" = "normalized_surprisal",
  "Log(Frequency)" = "frequency"
)

### Plots outcome of regression
df_all_coefficients %>%
  filter(predictor != "(Intercept)") %>%
  ggplot(aes(x = language,
             y = estimate)) +
  geom_point() +
  coord_flip() +
  geom_hline(yintercept = 0, linetype = "dotted") +
  geom_errorbar(aes(ymin = estimate - 2*std.error, 
                    ymax = estimate + 2*std.error), 
                width=.2,
                position=position_dodge(.9)) +
  labs(x = "Predictor",
       y = "Estimate") +
  theme_bw() +
  facet_wrap(~predictor) +
  PlotTheme

### Save to figure
ggsave("../Figures/coefficients_faceted_factor.png", dpi = 300)
## Saving 7 x 5 in image

Extension: Neighborhoods

Across all languages tested, it appears that Frequency exhibits a negative relationship with Homophony Delta: more frequent wordforms are less homophonous than one would expect purely on the basis of their phonotactics.

But there are other predictions, again borne out by previous research (Trott & Bergen, 2020), about which areas of the lexicon may see the greatest increase or reduction in homophony. One example is Neighborhood Size. Wordforms with more neighbors might see a corresponding decrease in homophony for one of two reasons. These explanations are not mutually incompatible, though they do suppose distinct causal pathways for the reduction in homophony:

  1. If larger neighborhoods already increase the possibility of perceptual confusion (e.g., "bat" vs. "pat"), one might expect that, wordform \(w_i\) might have fewer homophones than one would expect on account of its phonotactic probability \(p_i\) if \(w_i\) already has a number of neighbors.

  2. If languages select against over-saturation (Trott & Bergen, 2020), and if this selection process ultimately results in larger neighborhood sizes, one would expect that the areas of the lexicon that have undergone the most negative selection relative to their phonotactics should have correspondingly larger neighborhoods.

Helper functions

run_regression_with_neighborhood_size = function(df) {
  
  # initialize output
  out = list()
  
  # Log neighborhood
  df$log_neighborhood = log10(df$neighborhood_size + 1)
  
  # Build full model
  model_full = lm(data = df,
                  delta ~ frequency + num_sylls_est + normalized_surprisal + log_neighborhood)

  # Build reduced model
  model_reduced = lm(data = df,
                     delta ~ num_sylls_est + normalized_surprisal + frequency)
  
  # Run model comparison
  comparison = anova(model_reduced, model_full)
  df_comp = broom::tidy(comparison) %>%
    na.omit()
  
  # Tidy model output
  df_model = broom::tidy(model_full)
  
  # Add to list parameters
  out$model = df_model
  out$comparison = df_comp
  
  out
}

plot_neighborhood_bins_residualized = function(df, n) {
  ### Plot residuals of delta ~ #sylls + surprisal + frequency against neighborhood size
  
  # Build reduced model
  model_reduced = lm(data = df,
                     delta ~ num_sylls_est + normalized_surprisal + frequency)
  df$resid = residuals(model_reduced)
  
  # Plots delta ~ frequency bins.
  df$neighborhood_binned = as.numeric(ntile(df$neighborhood_size, n = n))
  
  g = df %>%
    group_by(neighborhood_binned) %>%
    summarise(mean_delta = mean(resid),
              se_delta = sd(resid) / sqrt(n())) %>%
    ggplot(aes(x = neighborhood_binned,
               y = mean_delta)) +
    geom_point(stat = "identity", size = .2) +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = mean_delta - se_delta, 
                      ymax = mean_delta + se_delta), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = "Binned Neighborhood Size",
         y = "Residuals (Reduced Model)") +
    geom_smooth() +
    theme_minimal()

    g
}

plot_outcome = function(df_output) {
  
  df_output$model$predictor = fct_recode(
    df_output$model$term,
    "Number of Syllables" = "num_sylls_est",
    "Normalized Surprisal" = "normalized_surprisal",
    "Log(Frequency)" = "frequency",
    "Log(Neighborhood Size)" = "log_neighborhood"
  )
  
  ### Plots outcome of regression
  g = df_output$model %>%
    ggplot(aes(x = predictor,
               y = estimate)) +
    geom_point() +
    coord_flip() +
    geom_hline(yintercept = 0, linetype = "dotted") +
    geom_errorbar(aes(ymin = estimate - 2*std.error, 
                      ymax = estimate + 2*std.error), 
                  width=.2,
                  position=position_dodge(.9)) +
    labs(x = "Predictor",
         y = "Estimate") +
    theme_minimal()
  
  g
}

English

df_merged_english %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_merged_english %>%
  plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_merged_english %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -5.28     0.529      -9.99 1.98e- 23
## 2 frequency              -0.455    0.0938     -4.85 1.25e-  6
## 3 num_sylls_est           0.235    0.121       1.94 5.21e-  2
## 4 normalized_surprisal    3.91     0.171      22.9  1.57e-114
## 5 log_neighborhood       -1.69     0.234      -7.21 5.67e- 13
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  20844 2524551.     1 6301.      52.0 5.67e-13

Dutch

df_merged_dutch %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_merged_dutch %>%
  plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_merged_dutch %>% 
  run_regression_with_neighborhood_size()



output %>%
  plot_outcome()

output$model
## # A tibble: 5 x 5
##   term                 estimate std.error statistic  p.value
##   <chr>                   <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept)            -6.43      0.912    -7.05  1.82e-12
## 2 frequency              -3.30      0.183   -18.0   3.74e-72
## 3 num_sylls_est           0.160     0.190     0.838 4.02e- 1
## 4 normalized_surprisal    6.80      0.338    20.1   2.57e-89
## 5 log_neighborhood       -0.797     0.449    -1.78  7.58e- 2
output$comparison
## # A tibble: 1 x 6
##   res.df       rss    df sumsq statistic p.value
##    <dbl>     <dbl> <dbl> <dbl>     <dbl>   <dbl>
## 1  27610 15647403.     1 1786.      3.15  0.0758

German

df_merged_german %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_merged_german %>%
  plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_merged_german %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)           -3.85      0.439      -8.76 2.07e- 18
## 2 frequency             -1.55      0.0928    -16.7  4.98e- 62
## 3 num_sylls_est          0.0842    0.0835      1.01 3.13e-  1
## 4 normalized_surprisal   4.19      0.162      25.9  2.06e-146
## 5 log_neighborhood      -1.67      0.253      -6.61 3.96e- 11
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df sumsq statistic  p.value
##    <dbl>    <dbl> <dbl> <dbl>     <dbl>    <dbl>
## 1  23807 2739371.     1 5026.      43.7 3.96e-11

French

df_french %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_french %>%
  plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_french %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -2.47     0.294      -8.41 4.12e- 17
## 2 frequency              -0.505    0.105      -4.80 1.58e-  6
## 3 num_sylls_est          -0.143    0.0555     -2.58 1.00e-  2
## 4 normalized_surprisal    3.06     0.108      28.3  2.65e-174
## 5 log_neighborhood       -3.47     0.141     -24.5  5.49e-132
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df  sumsq statistic   p.value
##    <dbl>    <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1  37273 2644265.     1 42740.      602. 5.49e-132

Japanese

df_japanese_merged %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_japanese_merged %>%
  plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_japanese_merged %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)            -0.280    0.177      -1.58 1.14e-  1
## 2 frequency              -1.26     0.0893    -14.1  6.63e- 45
## 3 num_sylls_est          -0.396    0.0238    -16.7  2.79e- 62
## 4 normalized_surprisal    2.70     0.0843     32.1  5.01e-223
## 5 log_neighborhood       -2.46     0.0740    -33.2  1.07e-238
output$comparison
## # A tibble: 1 x 6
##   res.df      rss    df  sumsq statistic   p.value
##    <dbl>    <dbl> <dbl>  <dbl>     <dbl>     <dbl>
## 1  40444 1172019.     1 31970.     1103. 1.07e-238

Mandarin: Chinese Lexical Database

df_mandarin_merged %>%
  plot_bins(n = 20, column = "neighborhood_size", label = "Binned neighborhood_size")
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

df_mandarin_merged %>%
  plot_neighborhood_bins_residualized(n = 20)
## `summarise()` ungrouping output (override with `.groups` argument)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

output = df_mandarin_merged %>% 
  run_regression_with_neighborhood_size()

output %>%
  plot_outcome()

output$model
## # A tibble: 5 x 5
##   term                 estimate std.error statistic   p.value
##   <chr>                   <dbl>     <dbl>     <dbl>     <dbl>
## 1 (Intercept)             0.284    0.109       2.61 9.18e-  3
## 2 frequency              -0.178    0.0155    -11.5  1.88e- 30
## 3 num_sylls_est          -0.595    0.0259    -23.0  3.83e-116
## 4 normalized_surprisal    2.42     0.0673     35.9  1.17e-278
## 5 log_neighborhood       -2.10     0.0496    -42.4  0.
output$comparison
## # A tibble: 1 x 6
##   res.df     rss    df  sumsq statistic p.value
##    <dbl>   <dbl> <dbl>  <dbl>     <dbl>   <dbl>
## 1  41004 290500.     1 12718.     1795.       0